CBAS: context based arabic stemmer
نویسندگان
چکیده
Arabic morphology encapsulates many valuable features such as word’s root. Arabic roots are being utilized for many tasks; the process of extracting a word’s root is referred to as stemming. Stemming is an essential part of most Natural Language Processing tasks, especially for derivative languages such as Arabic. However, stemming is faced with the problem of ambiguity, where two or more roots could be extracted from the same word. On the other hand, distributional semantics is a powerful co-occurrence model. It captures the meaning of a word based on its context. In this paper, a distributional semantics model utilizing Smoothed Pointwise Mutual Information (SPMI) is constructed to investigate its effectiveness on the stemming analysis task. It showed an accuracy of 81.5%, with a at least 9.4% improvement over other stemmers.
منابع مشابه
Nahla A Belal CBAS : Context Based
Nahla A Belal CBAS: Context Based Arabic Stemmer Arabic morphology encapsulates many valuable features such as word’s root. Arabic roots are being utilized for many tasks the process of extracting a word’s root is referred to as stemming. Stemming is an essential part of most Natural Language Processing tasks, especially for derivative languages such as Arabic. However, stemming is faced with t...
متن کاملThe Enhancement of Arabic Stemming by Using Light Stemming and Dictionary-Based Stemming
Word stemming is one of the most important factors that affect the performance of many natural language processing applications such as part of speech tagging, syntactic parsing, machine translation system and information retrieval systems. Computational stemming is an urgent problem for Arabic Natural Language Processing, because Arabic is a highly inflected language. The existing stemmers hav...
متن کاملUnsupervised Learning of Arabic Stemming Using a Parallel Corpus
This paper presents an unsupervised learning approach to building a non-English (Arabic) stemmer. The stemming model is based on statistical machine translation and it uses an English stemmer and a small (10K sentences) parallel corpus as its sole training resources. No parallel text is needed after the training phase. Monolingual, unannotated text can be used to further improve the stemmer by ...
متن کاملNew stemming for arabic text classification using feature selection and decision trees
In this paper we conduct a comparative study between two stemming algorithms: khoja stemmer and our new stemmer for Arabic text classification (categorization), using Chisquare statistics as feature selection and focusing on decision tree classifier. Evaluation used a corpus that consists of 5070 documents independently classified into six categories: sport, entertainment, business, middle east...
متن کاملEnhancing Retrieval Effectiveness of Diacritisized Arabic Passages Using Stemmer and Thesaurus
In this paper we discuss the enhancement of Arabic passage retrieval for both diacritisized and nondiacritisized text. Most previous work suggested that retrieval start with pre-processing the Arabic text to remove the diacritical marks (short vowels) to unify the text. In most cases, this process causes considerable ambiguity at the word level in the absence of context. However, searching for ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1611.00027 شماره
صفحات -
تاریخ انتشار 2015